[Quantization] add humming kernel support for deepseek v4 by jinzhen-lin · Pull Request #24289 · sgl-project/sglang

jinzhen-lin · 2026-05-03T03:27:07Z

This PR add humming kernels to SGLang. This PR is based on #23754 , adding and improving support for DeepSeek V4 on top of it.

Humming Kenrels: https://github.com/inclusionAI/humming

vLLM supports:

Humming is a universal, high-performance quantization kernel (similar to the Marlin kernel), but offers several advantages over Marlin:

Extensive Quantization Support: Supports all combinations of W{1,2,3,4,5,6,7,8}A{16,8,4} for quantization inference.
Superior Performance: Humming outperforms Marlin, especially in large batch scenarios and on Hopper GPUs.
Enhanced JIT Support: Compared to the current Marlin JIT implementation in SGLang, Humming offers faster compilation.
For DeepSeek V4, Humming supports high-performance W4A8 implementation on Hopper.
Support DeepEP

Benchmark

Service start command

# marlin w4a16
sglang serve \
  --trust-remote-code \
  --model-path /home/admin/DeepSeek-V4-Flash \
  --tp 4 \
  --moe-runner-backend marlin \
  --tool-call-parser deepseekv4 \
  --reasoning-parser deepseek-v4 \
  --host 0.0.0.0 \
  --port 12321

# humming w4a16
sglang serve \
  --trust-remote-code \
  --model-path /home/admin/DeepSeek-V4-Flash \
  --tp 4 \
  --moe-runner-backend humming \
  --tool-call-parser deepseekv4 \
  --reasoning-parser deepseek-v4 \
  --host 0.0.0.0 \
  --port 12321

# humming w4a8
SGLANG_HUMMING_INPUT_QUANT_CONFIG='{"dtype": "float8e4m3"}' sglang serve \
  --trust-remote-code \
  --model-path /home/admin/DeepSeek-V4-Flash \
  --tp 4 \
  --moe-runner-backend humming \
  --tool-call-parser deepseekv4 \
  --reasoning-parser deepseek-v4 \
  --host 0.0.0.0 \
  --port 12321

Benchmark command

# for prefill
python3 -m sglang.bench_serving \
    --backend sglang \
    --tokenizer /home/admin/DeepSeek-V4-Flash/ \
    --port 12321 \
    --dataset-name random-ids \
    --random-input-len 65536 \
    --random-output-len 1 \
    --num-prompts 128 \
    --max-concurrency 32 \
    --random-range-ratio 1

# for decoding
python3 -m sglang.bench_serving \
    --backend sglang \
    --tokenizer /home/admin/DeepSeek-V4-Flash/ \
    --port 12321 \
    --dataset-name random-ids \
    --random-input-len 1 \
    --random-output-len 4096 \
    --num-prompts 256 \
    --max-concurrency 64 \
    --random-range-ratio 1

Benchamrk result (TPS)

	Prefill	Decoding
Marlin W4A16	7272.49	2286.72
Humming W4A16	9060.34	2278.49
Humming W4A8	9917.43	2335.14

In SGLang, splitkv_mla and paged_mqa are used for the prefill part of DeepSeek V4, and the attention part takes longer than expected. If fixed, Humming is expected to achieve a greater e2e improvement.

gemini-code-assist

Code Review

This pull request introduces the "Humming" quantization backend and MoE runner, adding optimized Triton and CUDA kernels for specialized quantization formats like MXFP4. The feedback highlights critical issues such as a potential memory leak in runner registration, possible out-of-bounds memory access in the Triton kernel, and problematic in-place configuration modifications. Additionally, the review suggests fixing a typo in attribute mapping, removing redundant rounding operations, and handling variable data type sizes more accurately during memory allocation.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…m.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

fzyzcjy · 2026-05-03T08:20:22Z

cc @Fridge003 for hopper w4a16 kernels

superuct · 2026-05-06T07:27:24Z

Hello，does DeepSeek-V4 Pro can use the humming kernel?

jinzhen-lin · 2026-05-06T07:29:45Z

It should be supported, but I haven't actually run it myself yet. Welcome to try and feedback.

huangzhilin-hzl · 2026-05-07T07:30:54Z

@jinzhen-lin Hi, fix sgl-kernel build error jinzhen-lin#1

huangzhilin-hzl · 2026-05-07T08:13:29Z

Fix applying the 2604B SwiGLU clamp/checker path jinzhen-lin#2

huangzhilin-hzl · 2026-05-08T13:15:46Z

fix DeepEP empty-token path error jinzhen-lin#3

superuct · 2026-05-09T10:34:14Z

It should be supported, but I haven't actually run it myself yet. Welcome to try and feedback.
Yes. I have tried the Pro model. Humming also worked with your latest code.

txh1873749380 · 2026-05-12T07:57:38Z

Bug: Garbled token insertion with Humming MXFP4 + DeepEP on H200

Environment

DeepSeek-V4-Pro, 2-node 16×H200
--moe-runner-backend humming --moe-a2a-backend deepep --tp 16 --dp 16
Image: sglang:deepseek-v4-hopper-humming-0509
No speculative decoding

Symptom
~18-60% of responses contain a single wrong token inserted mid-generation. The model is clearly confident about the intended output, but the actual token is completely unrelated:

16.独自3 MB   → should be  16.3 MB
nicely坊and   → should be  nicely and
72.orra1 MB   → should be  72.1 MB
manner爆发     → should be  manner

Occurs in both thinking chain and final content (independently, never both in the same request).

Trigger
Prompt containing large integers + tab-separated tables (e.g. network traffic stats with byte counts like 858993773, 3543749412). UUID strings are NOT required — plain conn-001 style IDs trigger the same issue.

Ruled out

Not a gateway issue (tested via direct pod IP)
Not prefix cache (fresh requests affected)
Same prompt on DeepSeek official website produces clean output → deployment-side issue
PR [Quantization] add humming kernel support for deepseek v4 #24289 moe_fused_mul_sum.py out-of-bounds fix is already applied

Hypothesis
Large integer values produce high-outlier activations. MXFP4 block scaling loses precision on these activations, causing logit perturbation large enough to flip argmax to a wrong token — specifically at positions where the model is computing a decimal conversion (e.g. bytes → MB). @jinzhen-lin jinzhen-lin

jinzhen-lin · 2026-05-12T08:26:37Z

@txh1873749380 Which specific commit are you using? Does it include commit ff72f25 ? I suspect it might be the SwiGLU clamp issue, but that should have been fixed already.

txh1873749380 · 2026-05-12T08:41:14Z

@jinzhen-lin Agreed. I suspect the SwiGLU clamp too. Working on a proper fix.

txh1873749380 · 2026-05-12T08:45:42Z

@jinzhen-lin I'm on a commit from before last week, so I likely don't have ff72f25. Checked the recent changes and they overlap almost exactly with what I'm working on. Still figuring out the right clamp fix.

jinzhen-lin added 3 commits April 30, 2026 16:37

add humming kernel

bc04e66

fix import

8c05a5e

add deepseek-v4 support

b28ab70

jinzhen-lin requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock, merrymercy, mickqian, yhyang201 and yizhang2077 as code owners May 3, 2026 03:27

github-actions Bot added deepseek sgl-kernel diffusion SGLang Diffusion labels May 3, 2026

gemini-code-assist Bot reviewed May 3, 2026

View reviewed changes

jinzhen-lin and others added 3 commits May 3, 2026 12:04

Update python/sglang/srt/layers/quantization/mxfp4_deepseek.py

6cd7e02

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update python/sglang/srt/layers/moe/fused_moe_triton/moe_fused_mul_su…

3a56a7e

…m.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update python/sglang/srt/layers/quantization/humming.py

c955785

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

jinzhen-lin changed the title ~~Humming deepseek v4~~ [Quantization] add humming kernel support for deepseek v4 May 3, 2026

use WeakValueDictionary

dbfc6c0

jinzhen-lin mentioned this pull request May 3, 2026

DeepSeek V4 Roadmap #23602

Open

33 tasks

Fix missing moe_permute_prepare source

054727f

Fix missing moe_permute_prepare source (#1)

8415eb8

Fix Humming DeepSeek V4 startup bugs (#2)

ff72f25

jinzhen-lin mentioned this pull request May 7, 2026

[Quantization] add humming quantization kernel #23754

Open

Fix Humming CP DeepEP empty-token path (#3)

349c90f

Conversation

jinzhen-lin commented May 3, 2026

Benchmark

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fzyzcjy commented May 3, 2026

Uh oh!

superuct commented May 6, 2026

Uh oh!

jinzhen-lin commented May 6, 2026

Uh oh!

huangzhilin-hzl commented May 7, 2026

Uh oh!

huangzhilin-hzl commented May 7, 2026

Uh oh!

huangzhilin-hzl commented May 8, 2026

Uh oh!

superuct commented May 9, 2026

Uh oh!

txh1873749380 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jinzhen-lin commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

txh1873749380 commented May 12, 2026

Uh oh!

txh1873749380 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

txh1873749380 commented May 12, 2026 •

edited

Loading

jinzhen-lin commented May 12, 2026 •

edited

Loading